The design and psychometric evaluation of a COVID-19 social stigma questionnaire in nurses

Background The patient’s fear of social reactions, the disease stigma, and being a transmission agent is a psychological and social consequence of contracting some diseases, especially infectious ones, in any society. The present study aimed to design and psychometrically evaluate a COVID-19 social stigma questionnaire in nurses. Methods This mixed-method study was conducted using a sequential exploratory approach according to the Creswell method in Mazandaran Province (Iran) during 2021-22. The study was performed in three phases: (1) a qualitative phase to explain the key concept, (2) designing the scale items, and (3) an experimental phase with the scale psychometric evaluation. In the first phase, nurses’ experiences regarding the concept of COVID-19 social stigma were evaluated using a qualitative method with an inductive qualitative content analysis approach. In this phase, the lived experiences of 12 nurses working at hospitals of Babol University of Medical Sciences were extracted through in-depth interviews with semi-structured questions and analyzed by conventional content analysis. The main classes are contradictory feelings, rejection, and adaptation strategies. In the second phase, the designed items were validated by determining face validity, content validity, and construct validity using exploratory factor analysis (EFA). In addition, the scale’s reliability was determined through internal consistency and stability. Results Following the study’s first phase, a pool of questions with 64 initial items was formed. After evaluating face and content validity, the number of items was reduced to 24 cases. An excellent total content validity (S-CVI/Ave) of 0.93 was calculated for the scale. According to EFA outputs, three factors accounted for the most variance (52.82%), and four items were excluded in this phase. The Kaiser-Meyer-Olkin (KMO) statistic and Bartlett’s test of sphericity were calculated at 0.776 and P < 0.001, respectively. The results of Cronbach’s alpha (0.796) and intraclass correlation (0.793) indicated the correlation and internal consistency of the scale. Conclusion This scale can help healthcare managers and policymakers apply necessary protective measures by evaluating the social stigma of COVID-19 in nurses and emerging infectious diseases that may occur in the future. Supplementary Information The online version contains supplementary material available at 10.1186/s12912-023-01620-2.


INTRODUCTION
The estimation of reliability and validity in any research is a very important.For us to achieve the goal of the research, we are usually faced with two issues; the first is how do we ascertain that we are indeed measuring what we want to measure?", and "if we repeat the measurement, are we sure we will get the same result?"The first question is related to the issues of validity and second to reliability.These two concepts are referred to as psychometric properties.
The term reliability in psychological research refers to the consistency of a research study or measuring test (McLeod, 2007).If findings from research can be replicated consistently they are reliable.Most times obtaining the same results may not be feasible as participants and situations vary.However, if a strong positive correlation exists between the results of the same test, this indicates reliability (Balkin, 2017).
Many definitions abound in the literature of psychometrics of reliability.According to Wilkinson and Robertson (2006) reliability with respect to research means "repeatability" or "consistency".Reliability can also be defined as the degree to which an assessment tool produces stable and consistent results (Meyer, 2010).On his part Mellenbergh, (2011) opined that reliability is the consistency of a test, or the degree to which the test gives consistent results.It is also seen as a measure of a test's precision.Reliability is the extent to which an experiment, test, or any measuring procedure yields the same result on repeated trials.

According
to National Council on Measurement in Education (NCME;1999), reliability in statistics and psychometrics is the overall consistency of a measure.A measure is said to have a high reliability if it produces similar results under consistent conditions.It is the characteristic of a set of test scores that relates to the amount of random error from the measurement process that might be embedded in the scores.Scores that are highly reliable are accurate, reproducible, and consistent from one testing occasion to another.That is, if the testing process were repeated with a group of test takers, essentially the same results would be obtained.
According to the standards written by the American Educational Research Association (AERA), American Psychological Association (APA), and the National Council on Measurement in Education (NCME), 2014) reliability refers to the consistency of measurements when a testing process is repeated for an individual or group of individuals.
Reliability is an extent to which a questionnaire, test, observation or any measurement procedure produces the same results on repeated trials (Bolarinwa, 2015).In short, it is the stability or consistency of scores over time or across raters (Miller, 2015).It is worthy to note that lack of reliability may arise from divergences between observers or instruments of measurement or instability of the attribute being measured (Last, 2015).Nunnally, (cited in Bardhoshi, et al. 2016) opined that measurements are reliable to the extent that they are repeatable and that any random influence which tends to make measurements different from occasion to occasion or circumstance to circumstance is a source of measurement error.
According to Kline (2000) reliability as it applies to test, has two distinct meanings.One refers to stability over time, the second to internal consistency Reliability is the degree to which a test consistently measures whatever it measures.Reliability is an indicator of consistency, that is, an indicator of how stable a test score or data is across applications or time.A measure should produce similar or the same results consistently if it measures the same "thing."(Sawilowsky, 2000).A measure can be reliable without being valid but a measure cannot be valid without being reliable (Erford, 2013).
Correlation coefficient plays an important role in the determination of the degree of reliability.A correlation coefficient of + 1.0 is regarded as perfect positive relationship, -1.0 as a perfect negative relationship and that of 0.0 indicates no relationship.
The nearer a correlation is to +1.0, the more reliable the results.If a measure is perfectly reliable, there is no error in measurement, that is, everything we observe is true score.Therefore, for a perfectly reliable measure, the reliability = 1.Now, if we have a perfectly unreliable measure, there is no true score, that is, the measure is entirely in error.In this case, the reliability = 0.The value of a reliability estimate tells us the proportion of variability in the measure attributable to the true score.A reliability of 0.5 means that about half of the variance of the observed score is attributable to truth and half is attributable to error.According to American Educational Research Association (AERA), American Psychological Association (APA), and the National Council on Measurement in Education (NCME) (2014) a reliability of 0.8 means the variability is about 80% true ability and 20% error.All measurement procedures involve error.However, it is the amount/degree of error that indicates how reliable a measurement is.When the amount of error is low, the reliability of the measurement is high.Conversely, when the amount of error is large, the reliability of the measurement is low, (Elford, 2013;Meyer, 2010).
It is fundamental to note that reliability refers to the result and not the test itself.The samples from which the reliability coefficient are derived must be representative of the population for whom the test is designed and sufficiently large to be statistically reliable (Leann, & Ken, 2012).According to Kline (2000), a reliability of 0.7 is a minimum for a good test.This is simply because the standard error of measurement (which is the estimated standard deviation of scores) of scores increases as the reliability decreases.
In general, there are four broad types of reliability: test retest reliability, parallel forms reliability, internal consistency of reliability, and interrater reliability (Kaplan & Saccuzzo, 2005).In this study, we shall examine stability (test-retest) and internal consistency (Cronbach alpha).

Test -retest Reliability (or Stability)
Test-retest reliability (also called Stability) answers the question, "will the scores be stable over time."Testretest reliability refers to the temporal stability of a test from one measurement session to another.The procedure is to administer the test to a group of respondents and then administer the same test to the same respondents at a later date.The correlation between scores on the identical tests given at different times operationally defines its test-retest reliability.Two assumptions underlie the use of the test-retest procedure; (Wells, 2003).
 The first required assumption is that the characteristic that is measured does not change over the time period called 'testing effect' (Engel & Schutt, 2013).
 The second assumption is that the time period is long enough yet short in time that the respondents' memories of taking the test, the first time do not influence their scores at the second time and subsequent test administrations called 'memory effect'.
The estimate of test-retest reliability is also known as the coefficient of stability (Cohen et al., 1996).Test-retest correlation provides an indication of stability over time (Wong, Ong & Kuek, 2012, Pedisic et al., 2014;Deniz, & Alsaffar, 2013).In other words, the scores are consistent from the first administration to the second administration.In using this form of reliability, one needs to be careful with questionnaire or scales that measure variables which are likely to change over a short period of time, such as energy, happiness and anxiety because of maturation effect (Drost, 2011).For well-developed standardized achievement test administered reasonably close together, test-retest reliability estimates tend to range between 0.70 and 0.90 (Popham, 2000).
Despite its appeal, the test-retest reliability technique has several limitations (Rosenthal & Rosnow, 1991).For instance, when the interval between the first and second test is too short, respondents might remember what was on the first test and their answers on the second test could be affected by memory.Alternatively, when the interval between the two tests is too long, maturation happens.Kaplan and Saccuzzo (2005) noted that test-retest reliability estimates evaluate the reliability of instrument scores when an instrument is given at multiple and subsequent points in time.Joppe, (2000) detects a problem with the testretest method which can make the instrument, to a certain degree, unreliable.She explains that test-retest method may sensitize the respondent to the subject matter, and hence influence the responses given.Similarly, Crocker and Algina (1986) noted that when a respondent answers a set of test items, the score obtained represents only a limited sample of behaviour.

Internal consistency
Internal consistency reliability answers the question, "How well does each item measure the content or construct under consideration?"The appeal of an internal consistency index of reliability is that it is estimated after only one test administration and therefore avoids the problems associated with testing over multiple time periods.(Wong, Ong, & Kuek, 2012).The internal consistency reliability estimate refers to the inter-correlations between items on the same instrument (Kaplan & Saccuzzo, 2005).Cronbach's coefficient alpha is one of the most frequently used ways of estimating internal consistency of reliability (Dimitrov, 2002).The α coefficient is the most widely used procedure for estimating reliability in applied research.As stated by Sijtsma ( 2009), its popularity is such that Cronbach (1951) has been cited as a reference more frequently than the article on the discovery of the DNA double helix.Nevertheless, its limitations are well known (Yang & Green, 2011), some of the most important being the assumptions of uncorrelated errors, tau-equivalence and normality.

Sample size determination in reliability
The issue of sample size determination has been a major problem for researchers and psychometricians in reliability study.Existing approaches to determining sample size for psychometric studies have been varied and are not straightforward.This has made the psychometric literature to contains a wide range of articles that propose a variety of sample sizes (Donner & Eliasziw 1987;Eliasziw et al., 1994;Cocchetti, (1999); Charter, (1999); Mendoza, Stafford, & Stauffer, (2000); Bonett, 2002).These studies are classified into two broad categories: those based on authors' experiences and those on statistical theory.
In the studies based on judgments from authors' experiences (De Vellis, 1991;Rea, & Parker, 1992;Ferguson, & Cox, 1993), the sample size recommendations vary widely.Other authors advocated and suggested that samples should exceed 300 (Ware, et al., (1997), whereas some posited that much smaller samples as little as 30 subjects (Rea, & Parker, 1992;Bonett & Wright, 2014) may suffice.The second category of sample size recommendations includes those studies grounded in statistical theory (Feldt, et al., 1987;Donner & Eliasziw, 1987;Eliasziw, et al., 1994;Bonett, (2002).These differ in approaches for reliability testing (Charter, 1999;Mendoza et al., 2000) and recommendations ranging from n = 25 (Cocchetti, 1999) to 400 for reliability testing (Charter, 1999).Kline, (2000) advised that researchers should use at least 100 participants per item on our scale if the reliability estimate is to be meaningful.A lot of surprising differences of opinion on sample size determination abound in literature.Some authors are suggesting that samples as small as thirty (30) (Bonett, & Wright, 2014), can measure the reliability, so long as the scale items have strong intercorrelation.Toing the same line Nunnally & Bernstein (1994), averred that minimum criteria for reliability coefficients for Cronbach's Alpha is 0.80; 0.30 for itemtotal correlations, 0.30 for item-item correlations, and 0.80 for intra-class correlation coefficients.Kline (1986) suggested a minimum sample size of 300, as did Nunnally & Bernstein (1994).Segall (1994) called a sample size of 300 "small".Charter (1999) stated that a minimum sample size of 400 was needed for a sufficiently precise estimate of the population coefficient alpha.Charter (2003) opined that with low sample sizes alpha coefficients can be unstable.Walker and Zhang (2004) suggested a minimum sample size of 125 to 150 for calculating reliability, with at least as many people in the sample as items on the test.However, minimum sample size for the sample coefficient alpha has been frequently debated due to the difficulty of data collection in psychometric research.Although the determination of the sample size needed for reliability studies is somewhat subjective, a minimum of 400 subjects is recommended.
In reliability study, various sample sizes are used by different authors and researchers.Furthermore, there is no uniformity in the sample sizes been used.Sample size plays an important role on estimation of the reliability level of the measurement scale.
Correlations, along with most other statistical indices, have standard errors, indicating how trust worthy the results are.However, it can be said that the larger the number of subjects the smaller the standard error of the statistics (Erford, 2013).This means that it is essential that the reliability estimates are derived from a sample sufficiently large to minimize this statistical error (AERA, APA, & NCME, 2014).In reliability testing, determining the right sample size is oftentimes critical (Erford, 2013;Meyer, 2010).If the samples size used is too small, not much information can be obtained from the test.Thereby limiting one's ability to draw meaningful conclusions.On the other hand, if it is too large, information obtained through the test may be beyond what is needed (AERA, APA, & NCME, 2014).Thus, incurring unnecessary costs.But most times, the test developers do not have the luxury to request how many samples are needed but has to create a test plan based on the budget or resources constrains that are in place for the project.

Statement of the Problem
There are surprising differences of opinion in literature as regards the adequate sample size for establishing reliability of research instruments.For example, Kline (2000) noted that the standard advice is to use at least 100 participants per item on our scale if the reliability estimate is to be meaningful.On the other hand, Bonnet and Wright (2014) asserted that samples must be as small as thirty (30) to establish reliability so long as the scale items have strong inter-correlation.More so, many researchers use different sample sizes for establishing reliability estimates when carrying out research studies.Some use 20, 30, 40, 50 or 100 samples as the case may be.But no scientific research has been carried out to justify the usage of these samples sizes.Also, some researchers use different methods to establish the different types of reliability.For example, some use testretest for questionnaire instrument as against the popular Cronbach alpha (Vacha-Haase & Thompson 2010).
Although, the topic reliability has gained much attention in the literature, investigations into sample size requirements remain scarce.It is therefore imperative to examine the test-retest and Cronbach alpha (the most used reliability estimates) of an instrument using various sample sizes.

Research Questions
The following research questions were raised to guide the study.
1. Is there a difference in test retest reliability estimate of an instrument using various sample sizes of 20,30,40,50,100,150,200,300, 400? 2. Is there a difference in Cronbach alpha reliability estimate of an instrument using various sample sizes of 20,30,40,50,100,150,200,300,400?

Relevance of the Study
The findings of the study will help psychometricians, educators and researchers to be aware of the minimum sample size in carrying out reliability studies.This will put to an end the problem of choosing the right sample size for an acceptable reliability.It will be an eye opener to psychometricians and researchers on the method and sample size to use when conducting a reliability study.In the same vein, the findings will help psychometricians and researchers to estimate the proportion of variability in their measurement which is attributable to the true score.That is, it will help them to determine the amount /degree of error which indicates how reliable a measure is.When the amount of error is low, the reliability of the measurement is high and conversely, when the amount of error is large, the reliability of the measure is low.This study will also be beneficial to researchers and other stakeholders who may be having problems with choosing the appropriate methods of estimating reliability estimates.And this study will help all researchers and other stake holders to report accurately reliability estimates in any manuscripts (test manuals, conference papers and articles).

Methods
The survey research design was adopted for the study.The population of this study comprised of all the students in public Senior Secondary School in Benin metropolis in Edo state.A total of seventy-five (75) senior secondary schools with a total number of 40,815 students is in Benin metropolis.The breakdown is as follows: Egor Local government area 12 schools with 8,207 students; Oredo local government area have 13 thirteen senior secondary school with 12,154 students; Ikpoba Okha local government area have 27 senior secondary schools with 15,456 students and Ovia North East with 23 senior secondary school and 4998 students.The statistics of school and students were collected from the Ministry of Education, Benin City.A sample size of 408 students from senior secondary school was selected from thirty-eight (38) senior secondary schools in Benin metropolis.The multistage sampling techniques which involves various sampling stages was used for selecting the samples.The instrument for data collection was the Open Hemisphere Brain Dominance Scale 1.0 (OHBDS) a personality scale designed by Eric Jorgenson (2015).This was adapted by the researcher.It consists of two sections.Section A was used to elicit information from the student bio data, which includes their sex, and class.Section B consists of a twenty (20) items inventory designed to measure the hypothesized left-brain versus right brain preference among students with a 4 -point Likert scale.The items are under the options of response: SA = Strongly Agree, A = Agree, D = Disagree, SD = Strongly Disagree.SD will be score 1 point, D was scored 2 points, A was scored 3 points and SA scored 4 points.The instrument has been validated by Eric Jorgenson but was also validated by experts in Measurement and Evaluation, University of Benin, Benin City.The reliability of the instrument was part of issues raised in the study.
The reliability coefficient was estimated using the Pearson Product Moment Correlation Coefficient (r) for instrument that was subjected to test re-test, and Cronbach alpha , for instrument that was administered once.The Fisher's 95% confidence interval was used to determine which of the sample sizes give a stable result.The width of the interval for the various sample sizes was determined.The sample size(s) with a shorter interval was adjudged as the most stable and consistent  The result in Table 1 showed the Fisher 95% confidence interval of test retest reliability estimates for an instrument using various sample sizes of 20,30, 40,50,100,150,200,300, and 400.It further show that with a sample size of 20, the value was 0.55, with a 95% confidence interval of (0.14, 0.80) and a width of 0.66.When the sample was increased to 30 the value became 0.56 with a 95% confidence interval of (0.25, 0.77) and a width of 0.52.A sample size of 40 gave an value of 0.75 with a 95% confidence interval of (0.57, 0.86) and a width of 0.29.A sample size of 50 gave an value of 0.79 with a 95% confidence interval of (0.66, 0.88) and a width of 0.22.

RESULTS
When the size became 100, the value of became 0.81 with a 95% confidence interval of (0.73, 0.87)and a width of 0.14 .A sample size of 150 gave an value of 0.85 with a 95% confidence interval of (0.80, 0.89) and a width of 0.09.The sample size of 200 gave an value of 0.86 with a 95% confidence interval of (0.82, 0.89) and a width of 0.07.300 samples gave an value of 0.88 with a 95% confidence interval of (0.85, 0.90) and a width of 0.05.A sample size of 400 gave an value of 0.88 with a 95% confidence interval of (0.86, 0.90) and a width of 0.04.This is presented in figure 1.The result in Table 2 showed the Fisher 95% confidence interval of Cronbach alpha reliability estimates of an instrument using various sample sizes of 20, 30, 40, 50,100,150,200,300, and 400.It further show that with a sample size of 20, the value was 0.61, with a 95% confidence interval of (0.23, 0.83) and a width of 0.60.When the sample was increased to 30 the value became 0.69 with a 95% confidence interval of (0.44, 0.84) and a width of 0.40.A sample size of 40 gave an value of 0.78 with a 95% confidence interval of (0.62, 0.87) and a width of 0.26.Sample size of 50 gave an value of 0.80 with a 95% confidence interval of (0.68, 0.89) and a width of 0.21.When the size became 100, the value of became 0.83 with a 95% confidence interval of (0.77, 0.89).A sample size of 150 gave an value of 0.84 with a 95% confidence interval of (0.79, 0.88)and a width of 0.09.The sample size of 200 gave an value of 0.84 with a 95% confidence interval of (0.80, 0.88) and a width of 0.08.300 samples gave an value of 0.85 with a 95% confidence interval of (0.82, 0.88) and a width of 0.06.A sample size of 400 gave an value of 0.87 with a 95% confidence interval of (0.84, 0.89) and a width of 0.05.This is presented in figure 2.

DISCUSSION OF FINDINGS
The study revealed that the sample sizes of 20 and 30 using the test retest statistics were not reliable.The sample size of 40 and 50 though reliable, but the lower bound was outside the acceptable reliability of 0.70 for a test retest (Kline 2000).The reliability of the instrument became stronger when the sample size was at least 100.This finding is in line with Leann, & Ken, (2012) who affirmed that the samples from which the reliability coefficient is derived must sufficiently be large to be statistically reliable.The finding is also in collaboration with the study of Kline (2000) who noted that the standard advice is to use at least 100 participants per item on our scale if the reliability estimate is to be meaningful.In the same vein, the finding is supported by Ware et al. (1997) who asserted that samples should exceed 300.But the finding disagreed with Bonnet & Wright (2014) who asserted that samples must be as small as thirty (30) to establish reliability so long as the scale items have strong inter-correlation and Rea, & Parker, (1992) who posited that smaller samples as little as 30 subjects may suffice for a test retest reliability.
The study also revealed that the sample sizes of 20 and 30 using the Cronbach alpha statistics were not reliable.The sample size of 40 and 50 though reliable, but the lower bound was outside the 0.80 acceptable reliability coefficients for Cronbach's Alpha (Nunnally & Bernstein (1994).The reliability of the instrument became stronger when the sample size was at least 100.This finding is in line with AERA, APA, & NCME, (2014) and Erford, (2013) who stated that the larger the number of subjects the smaller the standard error of the statistic which means that it is essential that the reliability estimates are derived from a sample sufficiently large to minimize this statistical error.The finding is also in collaboration with the study of Kline (1986) who suggested a minimum sample size of 300, as did Nunnally & Bernstein (1994).Segall (1994) called a sample size of 300 "small".Charter (1999) stated that a minimum sample size of 400 was needed for a sufficiently precise estimate of the population coefficient alpha.Charter ( 2003) also noted that with low sample sizes alpha coefficients can be unstable.Walker and Zhang (2004) suggested a minimum sample size of 125 to 150 for calculating reliability, with at least as many people in the sample as items on the test.Charter, (1999) suggested a sample size of 400 for reliability testing.But the finding disagreed with Feldt et al., (1987), Donner & Eliasziw (1987), Eliasziw et al., (1994), Bonett, (2002), Charter, (1999), Mendoza et al., (2000) and Cocchetti, (1999) who recommended a sample size ranging from n = 25.
The difference in the finding of this study could be as a result of using observed values from the field.Most of the findings in literature were either from personal experience or statistical theorem.Unfortunately, much of the empirical evidence comes from simulated data.So their recommendations are incomplete because simulated data have important limitations as compared to observed data.They are based on preselected statistical or computer models that can only approximate observed data, have artificially controllable parameters, and are often generated to reflect randomly distributed samples.These limit the inferences that can be drawn from analyzing simulated data and necessitate the collection of observed data to ensure their credibility.
Another revelation from the study is that both the test retest and Cronbach reliability estimates started converging as from the sample size of 100 (see figures 1 and 2).This therefore implies that for an acceptable reliability study, at least one hundred subjects should be used.
The result of the study also revealed that the interval estimate gave a better reliability estimate than the point estimate for all the samples.For example, for the test-retest, a sample of 40 gave a reliability index of 0.75 as a point estimate, but the interval estimate gave a reliability estimate of (0.573, 0.860).The lower bound was outside the acceptable reliability index of .This collaborates with the study of AERA, APA, & NCME, (2014), who advocated reporting reliability estimates as interval estimate as against the point estimate previously used.

CONCLUSION
Based on the finding of this study, the following conclusions emerged.The result demonstrated that a number of differences exist in the sample size determination of a reliability study.The usage of sample sizes of twenty (20) and thirty (30) was not justified.This could be attributed to the fact that other studies who suggested a minimum of 20 and 30 subjects used simulated data as against observed data used in this study.
The larger the number of subjects the smaller the standard error of the statistic.To minimize this statistical error, it is essential that the reliability estimates are derived from a sample that is sufficiently large.The findings of the study have shown that the usage of sample sizes of 20 and 30 for reliability study is not justifiable.It has also show that for an acceptable reliability study, the sample size should be at least one hundred (100).

RECOMMENDATIONS
The reliability of any measuring instrument is a task frequently encountered in research.Sample size determination plays a very important role in the estimation of reliability.The higher the sample, the higher the reliability and the lower the error inherent in the instrument.Based on this, the following recommendations were made. 1.
Observed or field tested values should always be used in the estimation of the reliability of any measuring instrument.

2.
For a high reliability estimate, at least one hundred (100) subjects should be used.

3.
Reliability should not be reported as point estimate, but as interval estimate.